Geavanceerde data technieken

Gerko Vink

Methodology & Statistics @ Utrecht University

10 Jun 2025

Disclaimer

I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.

These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.

Warning

You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:

  • You must ensure that the content is not used for further training of the model

Slide materials and source code

Materials

Anatomy of an Answer

Terms I may use

  • TDGM: True data generating model
  • DGP: Data generating process, closely related to the TDGM, but with all the wacky additional uncertainty
  • Truth: The comparative truth that we are interested in
  • Bias: The distance to the comparative truth
  • Variance: When not everything is the same
  • Estimate: Something that we calculate or guess
  • Estimand: The thing we aim to estimate and guess
  • Population: That larger entity without sampling variance
  • Sample: The smaller thing with sampling variance
  • Incomplete: There exists a more complete version, but we don’t have it
  • Observed: What we have
  • Unobserved: What we would also like to have

At the start

Let’s start with the core:

Statistical inference

Statistical inference is the process of drawing conclusions from truths

Truths are boring, but they are convenient.

  • however, for most problems truths require a lot of calculations, tallying or a complete census.
  • therefore, a proxy of the truth is in most cases sufficient
  • An example for such a proxy is a sample
  • Samples are widely used and have been for a long timeSee Jelke Bethlehem’s CBS discussion paper for an overview of the history of sampling within survey statistics

Do we need data?

Without any data we can still come up with a statistically valid answer.

  • The answer will not be very informative.
  • In order for our answer to be more informative, we need more information

Some sources of information can already tremendously guide the precision of our answer.

In Short

Information bridges the answer to the truth. Too little information may lead you to a false truth.

Being wrong about the truth

  • The population is the truth
  • The sample comes from the population, but is generally smaller in size
  • This means that not all cases from the population can be in our sample
  • If not all information from the population is in the sample, then our sample may be wrong

Good questions to ask yourself

  1. Why is it important that our sample is not wrong?
  2. How do we know that our sample is not wrong?

Solving the missingness problem

  • There are many flavours of sampling
  • If we give every unit in the population the same probability to be sampled, we do random sampling
  • The convenience with random sampling is that the missingness problem can be ignored
  • The missingness problem would in this case be: not every unit in the population has been observed in the sample

Hmmm…

Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


    All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


Core assumption: all observations are bonafide

Uncertainty simplified

When we do not have all information …

  1. We need to accept that we are probably wrong
  2. We just have to quantify how wrong we are


In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.

The uncertainty measures about our estimates can be used to create intervals

Confidence in the answer

An intuitive approach to evaluating an answer is confidence. In statistics, we often use confidence intervals. Discussing confidence can be hugely informative!

If we sample 100 samples from a population, then a 95% CI will cover the true population value at least 95 out of 100 times.

  • If the coverage <95: bad estimation process with risk of errors and invalid inference
  • If the coverage >95: inefficient estimation process, but correct conclusions and valid inference. Lower statistical power.

How do we know that our sample is not….

We can replicate our sample.

  • A replication would be a new sample from the same population or true data generating model obtained by the same data generating process.
  • If we would sample 100 times, we would get 100 different samples
  • If we would estimate 100 times, we would get 100 different estimates with 100 different confidence intervals (e.g. 95% CI)
  • Out of these 100 different intervals, we would expect a nominal coverage. For a 95% CI we’d expect 95 of them to cover the true population value.

This is a lot of work…

Full sampling validation of a model’s inferences is a lot of work.

  • it is the most robust way of obtaining inferential validity
  • it is not always necessary

Under some general assumptions, we can use the same data to validate our model’s inferences and predictions.

  • these assumptions can be met in practice
  • but as soon as assumptions are made, we open the door to errors when these assumptions do not hold

Assumptions

Take the following definition:

a thing that is accepted as true or as certain to happen, without proof.

Assumptions are a statisticians faith. It is often impossible to prove that they hold in practice, but we choose to believe that they do.

Sensitivity analyses

I often use computational evaluation techniques to quantify the scope of the impact of assumptions made. For example, we can test the effect of violating assumptions on our results. We then verify if the inferences are sensitive to violations of the assumptions. We can even verify the extend of when assumptions start becoming influential to our inferences.

The holy trinity

Whenever I evaluate something, I tend to look at three things:

  • bias (how far from the truth)
  • uncertainty/variance (how wide is my interval)
  • coverage (how often do I cover the truth with my interval)


As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff

On the individual level

Individual intervals can also be hugely informative!

Individual intervals are generally wider than confidence intervals

  • This is because it covers inherent uncertainty in the data point on top of sampling uncertainty

Be careful

Narrower intervals mean less uncertainty.

It does not mean that the answer is correct!

Case: Spaceshuttle Challenger

36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.

Nothing happened, so we ignored it

In the decision to proceed with the launch, there was a presence of dark data. And no-one noticed!

Dark data
Information that is not available but necessary to arrive at the correct answer.

This missing information has the potential to mislead people. The notion that we can be misled is essential because it also implies that artificial intelligence can be misled!

If you don’t have all the information, there is always the possibility of drawing an incorrect conclusion or making a wrong decision.

In Practice

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

What would be a simple solution to allowing for valid inferences on the incomplete sample? Would that solution work in practice?

How to fix the missingness problem

There are two sources of uncertainty that we need to cover when analyzing incomplete data:

  1. Uncertainty about the data values we don’t have:
    when we don’t know what the true observed value should be, we must create a distribution of values with proper variance (uncertainty).
  2. Uncertainty about the process that generated the values we do have:
    nothing can guarantee that our sample is the one true sample. So it is reasonable to assume that the parameters obtained on our sample are biased.

A straightforward and intuitive solution for analyzing incomplete data in such scenarios is multiple imputation (Rubin, 1987).

Now how do we know we did well?

I’m really sorry!

In practice we don’t know if we did well, because we often lack the necessary comparative truths.

For example:

  1. Predict a future response, but we only have the past
  2. Analyzing incomplete data without a reference about the truth
  3. Estimate the effect between two things that can never occur together
  4. Detecting fraudulent transactions with only access to the own transaction history
  5. Appealing to a new customer base with only data about existing customers
  6. Mixing bonafide observations with bonafide non-observations

Case 1: How to evaluate without a truth?

Scenario

Let’s assume that we have an incomplete data set and that we can impute (fill in) the incomplete values under multiple candidate models

Challenge
Imputing this data set under one model may yield different results than imputing this data set under another model. Identify the best model!

Problem
We have no idea about validity of either model’s results: we would need either the true observed values or the estimand before we can judge the performance and validity of the imputation model.

Not all is lost

We do have a constant in our problem: the observed values

Solution - overimpute the observed values

Case 2: When you suspect your model is wrong

Scenario

  • 1236 citizens of Leiden who were 85 years or older on December 1, 1986 (Lagaay, Van der Meij, and Hijmans 1992).
  • Visited by a physician between January 1987 and May 1989.
  • A full medical history, information on current use of drugs, a venous blood sample, and other health-related data were obtained.
  • BP was routinely measured during the visit.
    • Apart from some individuals who were bedridden, BP was measured while seated.
    • An Hg manometer was used and BP was rounded to the nearest 5 mmHg.
  • The mortality status of each individual on March 1, 1994 was retrieved from administrative sources.

Problem

  • BP was measured less frequently for very old persons and for persons with health problems.
  • BP was measured more often if the BP was too high, for example if the respondent indicated a previous diagnosis of hypertension, or if the respondent used any medication against hypertension.
  • The missing data rate of BP also varied during the period of data collection.
    • The rate gradually increases during the first seven months of the sampling period from 5 to 40 percent of the cases, and then suddenly drops to a fairly constant level of 10–15 percent.
    • A complicating factor here is that the sequence in which the respondents were interviewed was not random.
    • High-risk groups, that is, elderly in hospitals and nursing homes and those over 95, were visited first.

Survival rate

Missingness

Sensitivity analysis

Case 3: When you know you’re wrong

Scenario

In a survey about research integrity and fraud we surveyed behaviours and practices in the following format.


Many behaviours were surveyed over multiple groups of people. Some findings:

  • In most groups similar behavioural prevalence was observed.
  • When looking at subgroups, prevalences differ between subgroups.
  • Not applicables were much more prevalent in one group than in other groups
  • There are too few cases and too many patterns with Not Applicable’s over features to allow for a pattern-wise analysis (stratified analysis).
  • There are too many Not Applicables to allow for listwise deletion.

Some background

We know:

  1. Not Applicable is not randomly distributed over the data. Removing them is therefore not valid!
  2. Not Applicable are bonafide missing values: there should be no observations.

There’s no such thing as a free lunch

Every imputation will bias the results. For some we know the direction of the bias, for some we have no idea. We do not have access to the truth.

What would you do?

Our solution

We chose to impute the data as 1 (never). There are a couple of reasons why we think that this is the best defendable scenario.

  1. Never has a semantic similarity to a behaviour not being applicable. However, Never implies intentionality; Not Applicable does not.
  2. We know the effect the imputation has on the inference: Filling in Never will underestimate intentional behaviours.

In this case the choice was made to make a deliberate error. The estimates obtained would serve as an underestimation of true behaviour and can be considered a lower bound estimation.

Sometimes, by sheer luck..

Dark data

Clichés out of the way!

Everything is a missing data problem

All models are wrong, but some are useful

How wrong can a model be to still be useful?

Topics for this lecture

  • Problem of dark data
  • Strategies to deal with missing data
  • Multiple imputation methodology to analyse incomplete data
  • Synthetic data sets for disclosure protection

What is dark data?

Dark data are concealed from us, and that very fact means we are at risk of misunderstanding, of drawing incorrect conclusions, and of making poor decisions.

Dark data types

No Description No Description
1 Data We Know Are Missing 9 Summaries of Data
2 Data We Don’t Know are Missing 10 Measurement Error and Uncertainty
3 Choosing Just Some Cases 11 Feedback and Gaming
4 Self-Selection 12 Information Asymmetry
5 Missing What Matters 13 Intentionally Darkened Data
6 Data Which Might Have Been 14 Fabricated and Synthetic Data
7 Changes with Time 15 Extrapolating beyond Your Data
8 Definitions of Data 16 Data not yet observable

Concepts: Definition

  • Missing values are those values that are not observed
  • Values do exist in theory, but we are unable to see them

Concepts: Reasons

Missing or dark data can occur for a lot of reasons. Or for no reason at all. For example

  • Intentional: Sample, predict, combine, estimate
    • routing
    • experimental design
    • join, merge and bind operations
  • Unintentional:
    • dropout, refusal, concealed
    • too far away, too small to observe
    • power failure, budget exhausted, bad luck

Consequences: Why are missing values problematic?

  • Cannot calculate, not even the mean
  • Less information than planned
  • Enough statistical power?
  • Different analyses, different \(n\)’s
  • Systematic biases in the analysis
  • Appropriate confidence interval, \(p\)-values?

Missing data can severely complicate interpretation and analysis

Notation: \(Y\), \(R\), \(X\)


  • \(Y\) random variable with missing data
  • \(Y^\mathrm{obs}\) true and observed values of \(Y\)
  • \(Y^\mathrm{mis}\) true but unobserved values of \(Y\)


  • \(X\) complete covariate


  • \(R\) response indicator
  • \(R = 1\) if \(Y\) is observed
  • \(R = 0\) if \(Y\) is missing

Types of distributions

  • Marginal distribution \(P(Y)\)
    • frequency distribution/histogram of \(Y\)
    • normal distribution with mean \(\mu\) and variance \(\sigma^2\)
  • Joint distribution \(P(Y, X)\)
    • contingency table/scatterplot of \(Y\) and \(X\)
    • bivariate normal distribution
  • Conditional distribution \(P(Y | X)\)
    • distribution of \(Y\) at a given value of \(X\)
    • regression model with normally distributed errors

The complete-data model

  • The model we would like to fit if we had complete data
  • The model of scientific interest
  • Examples:
    • \(P(Y | X, \theta)\): Predict blood pressure \(Y\) from health \(X\)
    • \(P(\theta | Y)\): Estimate gross domestic product \(\theta\) from production \(Y\)

The missing data model (mechanism)

  • The model that explains what is observed
  • Often not of direct scientific interest
  • Examples:
    • \(P(R | Y, X, \psi)\): Missingness depends on design covariates \(X\)
    • \(P(R | Y, \psi)\): Missingness depends on incomplete \(Y\)

Missing data mechanisms

Missing data mechanism: A key assumption

  • We assume we know where the missing data are

  • Cases where the assumption does not hold:

    • “Tick any of the following” (we don’t know which values are real)
    • Truncated data (we don’t know how many values are missing)

Missing data mechanism: Definition

  • Process that governs which \(Y\)s are observed and which \(Y\)s are unobserved (Rubin, 1976)
  • Sometimes we know this process (e.g.~experimental design, sampling)
  • Model by response probability \(P(R | Y^\mathrm{obs}, Y^\mathrm{mis}, X)\)
  • Also called missing data model

MCAR: Missing Completely at Random

  • Probability to be missing is not related to any data

\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) = P(R|\psi) \]

  • Examples
    • data transmission error
    • random sample

David Hand calls this mechanism Not Data Dependent

MAR: Missing at Random

  • Probability to be missing depends on known data

\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) = P(R|Y^\mathrm{obs}, X, \psi) \]

  • Examples
    • Income, where we have \(X\) related to wealth
    • Branch patterns (e.g. how old are your children?)

David Hand calls this mechanism Seen Data Dependent

MNAR: Missing Not at Random

  • Probability to be missing depends on unknown data

\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) \]

does not simplify

  • Examples
    • Income, without covariates related to income
    • Body weight report

David Hand calls this mechanism Unseen Data Dependent

Missing data mechanisms: roundup

  • Missing Completely at Random (MCAR)
    • missingness is purely random
    • relatively easy to deal with
  • Missing at Random (MAR)
    • missingness related to observed information
    • widely used for principled analysis
  • Missing Not at Random (MNAR)
    • missingness related to unobserved information
    • cannot detect this from the data
    • difficult to deal with, need context information

Missing data mechanisms: Graphical representation

Missing data mechanisms: Alternative terminology

  • Not Data Dependent (~MCAR)
    • It’s missing for reasons unrelated to the data
    • Probability to be missing is constant for all units
    • E.g. some students not sitting an exam due to flu symptoms
  • Seen Data Dependent (~MAR)
    • It’s missing for reasons related to data you have got
    • Probability to be missing depends on observed data
    • E.g. school discouraging lower performing students from sitting exam
  • Unseen Data Dependent (~MNAR)
    • Missing because of the values you would have obtained
    • Probability to be missing depends on unobserved data
    • E.g. students realized revised wrong material, so didn’t sit exam

Uwe Aickelin: What to do with the missing data?

https://www.youtube.com/watch?v=oCQbC818KKU

Lessons from Uwe Aickelin

  • You could obtain the data, but it’s not there
  • Quality of data is going down - big data
  • Why not go back to expert? Impractical
  • Why not delete? What to delete?
  • Reasons for missing data are important
  • Missing (Completely?) at Random
  • How impute? Mean, random, mean per group
  • Software cannot handle missing data
  • Forced internet surveys

Strategies to deal with missing data

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Prevent unintended missing data

1. Prevention strategies

  • Design: Time intervals, Number of variables, Pilot study
  • Collection: Incentives, Match interviewer-respondent, Quick follow-up, Retrieve missing data
  • Measures: Use short forms, Minimize intrusive measures, Clarity, Layout
  • Treatment: Minimize burden and intensity

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Ad-hoc methods make strong assumptions

2. Ad-hoc strategies

  • Listwise deletion
  • Mean imputation
  • Regression imputation
  • Stochastic regression imputation
  • Last observation carried forward (LOCF)
  • Indicator method

Listwise deletion

  • Analyze only the complete records
  • Also know as: complete-case analysis
  • Advantages
    • Simple (default in most software)
    • Unbiased under MCAR
    • Conservative standard errors, significance levels
    • Two special properties in regression

Listwise deletion

  • Disadvantages
    • Wasteful
    • May not be possible
    • Larger standard errors
    • Biased under MAR, even for simple statistics like the mean
    • Inconsistencies in reporting

Listwise deletion: Special properties

  • For any regression with missing data in the predictors, estimates under listwise deletion are unbiased as long as the missingness does not depend on the outcome. Even some MNAR cases (Glynn 1986; Little 1992).
  • In logistic regression only: With missing data in either the outcome \(Y\) or the predictors \(X\) (but not both), estimates of regression weights (but not the intercept) after listwise deletion are unbiased as long as the missingness depends only on \(Y\) (and not on \(X\)!) (Vach 1994). This property is widely exploited in case-control studies in epidemiology.
  • See FIMD 2.7: https://stefvanbuuren.name/fimd/sec-when.html

Mean imputation

  • Replace the missing values by the mean of the observed data
  • Advantages
    • Simple
    • Unbiased for the mean, under MCAR

Mean imputation

Mean imputation

  • Disadvantages
    • Disturbs the distribution
    • Underestimates the variance
    • Biases correlations to zero
    • Biased under MAR
  • AVOID (unless you know what you are doing)

Regression imputation

  • Also known as prediction
    • Fit model for \(Y^\mathrm{obs}\) under listwise deletion
    • Predict \(Y^\mathrm{mis}\) for records with missing \(Y\)s
    • Replace missing values by prediction
  • Advantages
    • Under MAR, unbiased estimates of regression coefficients
    • Good approximation to the (unknown) true data if explained variance is high
  • Favourite among data scientists and machine learners

Regression imputation

Regression imputation

  • Disadvantages
    • Artificially increases correlations
    • Systematically underestimates the variance
    • Too optimistic \(p\)-values and too short confidence intervals
  • AVOID. Harmful to statistical inference

Stochastic regression imputation

  • Like regression imputation, but adds appropriate noise to the predictions to reflect uncertainty
  • Advantages
    • Preserves the distribution of \(Y^\mathrm{obs}\)
    • Preserves the correlation between \(Y\) and \(X\) in the imputed data

Stochastic regression imputation

Stochastic regression imputation

  • Disadvantages
    • Symmetric and constant error restrictive
    • Single imputation: does not take uncertainty imputed data into account, and incorrectly treats them as real
    • Not so simple anymore

Overview of assumptions needed

Unbiased Standard Error
Mean Reg Weight Correlation
Listwise MCAR MCAR MCAR Too large
Pairwise MCAR MCAR MCAR Complicated
Mean MCAR Too small
Regression MAR MAR Too small
Stochastic MAR MAR MAR Too small
LOCF Too small
Indicator Too small

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Weighting minimizes bias with unit nonresponse

3. Weighting

  • Take the complete cases
  • Re-weight any statistic to the distribution of the covariates in the population
  • Advantages
    • Simple (one set of weights for all incomplete variables)
    • Reduces bias under MAR assumption
    • Standard methodology in official statistics
  • Disadvantages
    • Discards data, increases the variance
    • Weights may not be available
    • Needs special variance estimators
    • Limited to unit non-response

For inferences purposes, proper imputation strategies prove to quickle become more efficient and more accurate than weighting strategies (Boeschoten et al., 2017).

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Maximum likelihood: The royal road to missing data

4. Maximum likelihood

  • EM-algorithm, Full Information Maximum Likelihood (FIML)
  • Iterative methods to estimate parameters that “skip over” the missing data
  • Advantages:
    • Theoretically sound, optimizes likelihood calculation directly
    • Many applications, widely accepted
    • Easy to apply (when there is software)
  • Disadvantages:
    • Local minima, slow convergence
    • Difficult to apply outside standard models
    • Complete-data model becomes large and complex

Strategies

  1. Prevention
  2. Ad-hoc methods, e.g., single imputation, complete cases
  3. Weighting methods
  4. Likelihood methods, EM-algorithm
  5. Multiple imputation

Multiple imputation is an all-round principled method

5. Multiple imputation

  • One imputation cannot be correct in general
  • Imputes each missing value \(m\) times
  • Variation between the \(m\) imputed values reflects our ignorance about the unknown value

Multiple imputation workflow

Multiple imputation - 1987

Multiple imputation

  • Advantages
    • Correct point and variance estimates
    • Splits missing data problem from complete-data analysis
    • Theoretical properties well established
    • Flexible, widely applicable
    • Extensible to MNAR
  • Disadvantages
    • Need to create and work with multiple imputed data sets
    • May not always be most efficient

What is the goal of multiple imputation?

The goal:

  • IS NOT to find the correct value for a missing data point
  • IS to find an answer to the analysis problem, given that there are (many) data points missing.

We are not interested in whether the imputed value corresponds to its true counterpart in the population, but we rather sample plausible values that could have been from the posterior predictive distribution

Demonstration of imputation

Let our analysis model be

boys %$% # use the exposition pipe
  lm(hgt ~ age + tv)

Demonstration of imputation

with output

boys %$% 
  lm(hgt ~ age + tv) %>% 
  summary()

Call:
lm(formula = hgt ~ age + tv)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.679  -5.134  -0.398   5.175  23.778 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 105.4823     3.4704  30.395  < 2e-16 ***
age           3.8430     0.3262  11.782  < 2e-16 ***
tv            0.4919     0.1278   3.849 0.000155 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.389 on 221 degrees of freedom
  (524 observations deleted due to missingness)
Multiple R-squared:  0.7742,    Adjusted R-squared:  0.7721 
F-statistic: 378.8 on 2 and 221 DF,  p-value: < 2.2e-16

Demonstration of imputation

generated on 224 cases. The full data size is

boys %>% dim()
[1] 748   9

Demonstration of imputation

To impute and analyze the same model with mice, we can simply run:

boys %>% 
  mice(m = 5, method = "cart", printFlag = FALSE) %>% 
  complete("all") %>% 
  map(~.x %$% lm(hgt ~ age + tv)) %>% 
  pool() %>% 
  summary()
         term   estimate  std.error  statistic       df      p.value
1 (Intercept) 71.5467019 0.61617119 116.114975 735.4492 0.000000e+00
2         age  7.0475726 0.09475359  74.377898  75.9456 1.043045e-72
3          tv -0.5577935 0.09163996  -6.086793  39.9114 3.598196e-07

What have we done?

We have used mice to obtain draws from a posterior predictive distribution of the missing data, conditional on the observed data.

The imputed values are mimicking the sampling variation and can be used to infer about the underlying TDGM, if and only if:

  • The observed data holds the information about the missing data (MAR/MCAR)

Synthetic data generation

Imputation vs Synthetisation

Instead of drawing only imputations from the posterior predictive distribution, we might as well overimpute the observed data.

How to draw synthetic data sets with mice

boys %>% 
  mice(m = 5, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>% 
  complete("all") %>% 
  map(~.x %$% lm(hgt ~ age + tv)) %>% 
  pool() %>% 
  summary()
         term   estimate std.error statistic       df      p.value
1 (Intercept) 71.4727297 0.7637067 93.586620 71.79955 8.889816e-77
2         age  6.8882608 0.1210342 56.911699 21.81539 3.232474e-25
3          tv -0.4038602 0.1028084 -3.928281 33.89849 3.989687e-04

But we make an error!

Pooling in imputation

Rubin (1987, p76) defined the following rules:

For any number of multiple imputations \(m\), the combination of the analysis results for any estimate \(\hat{Q}\) of estimand \(Q\) with corresponding variance \(U\), can be done in terms of the average of the \(m\) complete-data estimates

\[\bar{Q} = \sum_{l=1}^{m}\hat{Q}_l / m,\]

and the corresponding average of the \(m\) complete data variances

\[\bar{U} = \sum_{l=1}^{m}{U}_l / m.\]

Pooling in imputation

Simply using \(\bar{Q}\) and \(\bar{U}_m\) to obtain our inferences would be to simplistic. In that case we would ignore any possible variation between the separate \(\hat{Q}_l\) and the fact that we only generate a finite set of imputations \(m\). Rubin (1987, p. 76) established that the total variance \(T\) of \((Q-\bar{Q})\) would equal

\[T = \bar{U} + B + B/m,\]

Where the between imputation variance \(B\) is defined as

\[B = \sum_{l=1}^{m}(\hat{Q}_l - \bar{Q})^\prime(\hat{Q}_l - \bar{Q}) / (m-1)\]

This assumes that some of the data are observed and remain constant over the synthetic sets

The total variance \(T\) of \((Q-\bar{Q})\) should (Reiter, 2003) equal

\[T = \bar{U} + B/m.\]

So, the correct code is

boys %>% 
  mice(m = 5, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>% 
  complete("all") %>% 
  map(~.x %$% lm(hgt ~ age + tv)) %>% 
  pool(rule = "reiter2003") %>% 
  summary()
         term   estimate  std.error  statistic         df       p.value
1 (Intercept) 71.3427191 0.67872263 105.113218 4670.20788  0.000000e+00
2         age  6.9161820 0.09961066  69.432144  178.28359 5.318364e-131
3          tv -0.4015052 0.09691269  -4.142958   56.62364  1.155999e-04

Why multiple synthetic sets?

Thank back about the goal of statistical inference: we want to go back to the true data generating model.

  1. We do so by reverse engineering the true data generating process
  2. Based on our observed data
  3. We do not know this process; hence multiple synthetic values

The multiplicity of the solution allows for smoothing over any Monte Carlo error that may arise from generating a single set.

Generating more synthetic data

mira <- boys %>% 
  mice(m = 6, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>% 
  list('1' = rbind(complete(., 1), complete(., 2)),
       '2' = rbind(complete(., 3), complete(., 4)),
       '3' = rbind(complete(., 5), complete(., 6))) %>% .[-1] %>% 
  data.table::setattr("class", c("mild", class(.))) %>% 
  map(~.x %$% lm(hgt ~ reg))

mira %>% pool(rule = "reiter2003") %>% 
  summary() %>% tibble::column_to_rownames("term") %>% round(3)
            estimate std.error statistic      df p.value
(Intercept)  152.014     3.746    40.582 112.526       0
regeast      -17.815     4.461    -3.993 771.076       0
regwest      -23.092     4.397    -5.252  95.353       0
regsouth     -27.579     4.451    -6.196 177.098       0
regcity      -25.928     6.111    -4.243  19.899       0
mira %>% pool(rule = "reiter2003", 
              custom.t = ".data$ubar * 2 + .data$b / .data$m") %>% 
  summary() %>% tibble::column_to_rownames("term") %>% round(3)
            estimate std.error statistic      df p.value
(Intercept)  152.014     5.118    29.703 112.526   0.000
regeast      -17.815     6.228    -2.860 771.076   0.004
regwest      -23.092     5.989    -3.856  95.353   0.000
regsouth     -27.579     6.125    -4.503 177.098   0.000
regcity      -25.928     7.928    -3.270  19.899   0.004

Some adjustment to the pooling rules is neede to avoid p-inflation.

Some care is needed

With synthetic data generation and synthetic data implementation come some risks.

Any idea?

What should synthetic data be?

Testing validity

Nowadays many synthetic data cowboys claim that they can generate synthetic data that looks like the real data that served as input.

This is like going to Madam Tusseaud’s: at face value it looks identical, but when experienced in real life it’s just not the same as the living thing.

Many of these synthetic data packages only focus on marginal or conditional distributions. With mice we also consider the inferential properties of the synthetic data.

In general, we argue [^4] that any synthetic data generation procedure should

  1. Preserve marginal distributions
  2. Preserve conditional distribution
  3. Yield valid inference
  4. Yield synthetic data that are indistinguishable from the real data

Example from simulation

When valid synthetic data are generated, the variance of the estimates is correct, such that the confidence intervals cover the population (i.e. true) value sufficiently [^5]. Take e.g. the following proportional odds model from Volker & Vink (2021):

term estimate synthetic
bias
synthetic
cov
age 0.461 0.002 0.939
hc -0.188 -0.004 0.945
regeast -0.339 0.092 0.957
regwest 0.486 -0.122 0.944
regsouth 0.646 -0.152 0.943
regcity -0.069 0.001 0.972
G1\(|\)G2 -6.322 -0.254 0.946
G2\(|\)G3 -4.501 -0.246 0.945
G3\(|\)G4 -3.842 -0.244 0.948
G4\(|\)G5 -2.639 -0.253 0.947

End of presentation

A. Bacall